fix: Fixed running multiple workflows with mutex and memo concurrently is broken#11883
Merged
terrytangyuan merged 1 commit intoargoproj:masterfrom Sep 25, 2023
Merged
fix: Fixed running multiple workflows with mutex and memo concurrently is broken#11883terrytangyuan merged 1 commit intoargoproj:masterfrom
terrytangyuan merged 1 commit intoargoproj:masterfrom
Conversation
Fixes argoproj#11853 Signed-off-by: shmruin <meme_hm@naver.com>
agilgur5
approved these changes
Sep 24, 2023
agilgur5
left a comment
There was a problem hiding this comment.
Solid analysis and description! Makes sense to me.
At worst, this release is a redundant no-op, so very safe change as far as I can tell
terrytangyuan
approved these changes
Sep 25, 2023
Member
|
Thanks! |
|
Thanks for the explanation and the fix! Can this PR be picked for v3.4.12 #11851 |
Contributor
Author
|
@Paritosh-Anand I guess this will be automatically included to that release because of no conflict expected. I will keep checking that thread 😀 |
terrytangyuan
pushed a commit
that referenced
this pull request
Oct 19, 2023
…y is broken (#11883) Signed-off-by: shmruin <meme_hm@naver.com>
5 tasks
dpadhiar
pushed a commit
to dpadhiar/argo-workflows
that referenced
this pull request
May 9, 2024
…y is broken (argoproj#11883) Signed-off-by: shmruin <meme_hm@naver.com> Signed-off-by: Dillen Padhiar <dillen_padhiar@intuit.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #11853
Motivation
When running multiple workflows with mutex synchronization and cache at the same time, there is an issue that mutex key is not released.
The first workflow will run correctly and release the keys, but the second workflow's 1st step will run but it does not release the key.
So this results in infinite pending nodes (like second workflow's 2nd or later nodes) as described in the issue.
Modifications
Omitting releasing lock in fulfilled node check after memo in
executeTemplatemakes this problem.When checking fulfilled node, this also should release any locks that this node have.
In
executeTemplate, lock is released at first of the function or after running the details (like this or this)However, if memoization exists, node is returned as a short-cut, so currently releasing lock only depends on the first.
Though most nodes go through
executeTemplatemultiple times by their state, Omitting releasing lock in fulfilled node check after memo surley makes the problem.Like this issue, when running multiple nodes concurrently, second workflow's first node is not
fulfilledbut cache is hit, so it pass the first and fulfilled node check after memo does not release the lock. so the later nodes cannot get the lock forever.I just fix this issue by adding omitted releasing lock to fulfilled node check after memo
Verification
Simply can test this changes works by running the workflow below few times concurrently.
The fist workflows cache hit is 'NO' and 'YES' and seconds workflows cache hit is all 'YES' with no error as expected.